Predict Customer's Income Range

Problem Statement

In this case study, we are going to build a classifier to calculate the probability of a customer Income Range.

Dataset Description:

Each row is about a customer. We have details about their relationship, occupation, education, marital status etc. In class column (target column), we have value <=50K and >=50K,

Importing the important Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, auc, roc_curve
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from statsmodels.formula.api import ols
from sklearn.tree import DecisionTreeRegressor
In [2]:
dt=pd.read_csv('adult.csv')
In [3]:
dt.head()
Out[3]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country class
0 2 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 1 0 2 United-States <=50K
1 3 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 0 United-States <=50K
2 2 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 2 United-States <=50K
3 3 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 2 United-States <=50K
4 1 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 2 Cuba <=50K
In [4]:
dt.columns
Out[4]:
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capitalgain', 'capitalloss', 'hoursperweek', 'native-country',
       'class'],
      dtype='object')
In [5]:
dt.dtypes
Out[5]:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capitalgain        int64
capitalloss        int64
hoursperweek       int64
native-country    object
class             object
dtype: object
In [6]:
dt.isnull().sum()
Out[6]:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capitalgain       0
capitalloss       0
hoursperweek      0
native-country    0
class             0
dtype: int64
In [7]:
dt['class'].unique()
Out[7]:
array(['<=50K', '>50K'], dtype=object)
In [8]:
dt.shape
Out[8]:
(48842, 15)
In [9]:
dt['workclass'].unique()
Out[9]:
array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)
In [10]:
dt['workclass']=dt['workclass'].replace({'?':'Unknown'})
In [11]:
dt.workclass.mode()
Out[11]:
0    Private
dtype: object
In [20]:
dt.workclass.value_counts()
Out[20]:
Private             33906
Self-emp-not-inc     3862
Local-gov            3136
Unknown              2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: workclass, dtype: int64

Feature-Engineering

In [21]:
g = sns.factorplot(x='workclass',y='capitalgain',hue='class',data=dt,kind='bar',size=6,palette='muted')

g.despine(left=True)

g = g.set_ylabels('Capital Gain')
plt.xticks(rotation = 45)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3672: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[21]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]), <a list of 9 Text xticklabel objects>)

Observation

  • Work class and Capital gain is shown in this plot w.r.t Class
  • We can clearly see that Self employed class is the highest contributor of capital gain
In [12]:
Price_month = dt['workclass'].value_counts()
months = pd.DataFrame(data=Price_month.index, columns=["workclass"])
months['values'] = Price_month.values
In [13]:
fig = px.pie(months, values='values', names='workclass', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

Observation

  • The pie chart shows percentage of different work classes
  • We can clearly see private job has the highest percentage
In [14]:
def histogram(data,path,color,title,xaxis,yaxis):
    fig = px.histogram(data, x=path,color=color)
    fig.update_layout(
        title_text=title,
        xaxis_title_text=xaxis, 
        yaxis_title_text=yaxis, 
        bargap=0.2, 
        bargroupgap=0.1
    )
    fig.show()
In [15]:
histogram(dt,"workclass","class",'class on workclass','workclass','Count')

Observation

  • Among all the Work class Private work class has the highest count w.r.t class
In [16]:
df_edu=dt.groupby('class')['workclass'].value_counts(normalize=True)
df_edu = df_edu.mul(100).rename('Percent').reset_index()
df_edu['Percent']=df_edu['Percent'].round(decimals=2)
df_edu.head(10)
Out[16]:
class workclass Percent
0 <=50K Private 71.37
1 <=50K Self-emp-not-inc 7.50
2 <=50K Unknown 6.82
3 <=50K Local-gov 5.95
4 <=50K State-gov 3.91
5 <=50K Federal-gov 2.34
6 <=50K Self-emp-inc 2.04
7 <=50K Without-pay 0.05
8 <=50K Never-worked 0.03
9 >50K Private 63.21
In [17]:
px.bar(df_edu, x='class', y='Percent', color='workclass', title="class as yes w.r.t workclass",
                    barmode='group', text='Percent')

Observation

  • Among all the Work class Private work class has the highest percentage.
  • 63% of private work class earns more than 50K
In [18]:
dt_edu = dt['education'].value_counts()
education = pd.DataFrame(data=dt_edu.index, columns=["education"])
education['values'] = dt_edu.values
In [19]:
fig = px.pie(education, values='values', names='education', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

Observation

  • For Education, HS-grad has the highest count
In [20]:
histogram(dt,"education","class",'class on education','education','Count')
In [21]:
df_edu=dt.groupby('class')['education'].value_counts(normalize=True)
df_edu = df_edu.mul(100).rename('Percent').reset_index()
df_edu['Percent']=df_edu['Percent'].round(decimals=2)
df_edu.head(10)
Out[21]:
class education Percent
0 <=50K HS-grad 35.74
1 <=50K Some-college 23.72
2 <=50K Bachelors 12.68
3 <=50K 11th 4.63
4 <=50K Assoc-voc 4.14
5 <=50K 10th 3.50
6 <=50K Masters 3.22
7 <=50K Assoc-acdm 3.20
8 <=50K 7th-8th 2.40
9 <=50K 9th 1.92
In [22]:
px.bar(df_edu, x='class', y='Percent', color='education', title="class  w.r.t education",
                    barmode='group', text='Percent')

Observation

  • People with Hs-grad has the highest count percentage w.r.t class as <=50k
  • People with Bachelors has the highest count percentage w.r.t class as >50k
In [23]:
dt_marr = dt['marital-status'].value_counts()
marriage = pd.DataFrame(data=dt_marr.index, columns=["marital-status"])
marriage['values'] = dt_marr.values
In [24]:
fig = px.pie(marriage, values='values', names='marital-status', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

Observation

  • For Marital status, Married-civ-spouse has the highest count
In [25]:
histogram(dt,"marital-status","class",'class on marital-status','marital-status','Count')
In [26]:
df_married=dt.groupby('class')['marital-status'].value_counts(normalize=True)
df_married = df_married.mul(100).rename('Percent').reset_index()
df_married['Percent']=df_edu['Percent'].round(decimals=2)
df_married.head(10)
Out[26]:
class marital-status Percent
0 <=50K Never-married 35.74
1 <=50K Married-civ-spouse 23.72
2 <=50K Divorced 12.68
3 <=50K Separated 4.63
4 <=50K Widowed 4.14
5 <=50K Married-spouse-absent 3.50
6 <=50K Married-AF-spouse 3.22
7 >50K Married-civ-spouse 3.20
8 >50K Never-married 2.40
9 >50K Divorced 1.92
In [27]:
px.bar(df_married, x='class', y='Percent', color='marital-status', title="class  w.r.t marital-status",
                    barmode='group', text='Percent')

Observation

  • People with Never-married status has the highest count percentage w.r.t class as <=50k
  • People with Married-civ-status has the highest count percentage w.r.t class as >50k
In [28]:
dt['occupation']=dt['occupation'].replace({'?':'Unknown'})
In [29]:
dt_occu = dt['occupation'].value_counts()
occupation = pd.DataFrame(data=dt_occu.index, columns=["occupation"])
occupation['values'] = dt_occu.values
In [30]:
fig = px.pie(occupation, values='values', names='occupation', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

Observation

  • For Occupation, Prof-speciality has the highest count
In [31]:
histogram(dt,"occupation","class",'class on occupation','occupation','Count')
In [32]:
df_occu=dt.groupby('class')['occupation'].value_counts(normalize=True)
df_occu = df_occu.mul(100).rename('Percent').reset_index()
df_occu['Percent']=df_occu['Percent'].round(decimals=2)
df_occu.head(10)
Out[32]:
class occupation Percent
0 <=50K Adm-clerical 13.03
1 <=50K Craft-repair 12.73
2 <=50K Other-service 12.70
3 <=50K Sales 10.84
4 <=50K Prof-specialty 9.12
5 <=50K Exec-managerial 8.55
6 <=50K Machine-op-inspct 7.13
7 <=50K Unknown 6.85
8 <=50K Handlers-cleaners 5.21
9 <=50K Transport-moving 5.04
In [33]:
px.bar(df_occu, x='class', y='Percent', color='occupation', title="class  w.r.t occupation",
                    barmode='group', text='Percent')

Observation

  • People with Adm-clerical occupation has the highest count percentage w.r.t class as <=50k
  • People with Exec-Managerial occupation the highest count percentage w.r.t class as >50k
In [34]:
dt_rel = dt['relationship'].value_counts()
relationship = pd.DataFrame(data=dt_rel.index, columns=["relationship"])
relationship['values'] = dt_rel.values
In [35]:
fig = px.pie(relationship, values='values', names='relationship', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()
In [36]:
histogram(dt,"relationship","class",'class on relationship','relationship','Count')
In [37]:
df_rel=dt.groupby('class')['relationship'].value_counts(normalize=True)
df_rel = df_rel.mul(100).rename('Percent').reset_index()
df_rel['Percent']=df_rel['Percent'].round(decimals=2)
df_rel.head(10)
Out[37]:
class relationship Percent
0 <=50K Not-in-family 30.43
1 <=50K Husband 29.26
2 <=50K Own-child 20.10
3 <=50K Unmarried 12.96
4 <=50K Other-relative 3.91
5 <=50K Wife 3.33
6 >50K Husband 75.69
7 >50K Not-in-family 10.92
8 >50K Wife 9.35
9 >50K Unmarried 2.64
In [38]:
px.bar(df_rel, x='class', y='Percent', color='relationship', title="class  w.r.t relationship",
                    barmode='group', text='Percent')
In [39]:
dt_race = dt['race'].value_counts()
race = pd.DataFrame(data=dt_race.index, columns=["race"])
race['values'] = dt_race.values
In [40]:
fig = px.pie(race, values='values', names='race', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

Observation

  • For Race, White has the highest count
In [41]:
histogram(dt,"race","class",'class on race','race','Count')
In [42]:
df_race=dt.groupby('class')['race'].value_counts(normalize=True)
df_race = df_race.mul(100).rename('Percent').reset_index()
df_race['Percent']=df_edu['Percent'].round(decimals=2)
df_race.head(10)
Out[42]:
class race Percent
0 <=50K White 35.74
1 <=50K Black 23.72
2 <=50K Asian-Pac-Islander 12.68
3 <=50K Amer-Indian-Eskimo 4.63
4 <=50K Other 4.14
5 >50K White 3.50
6 >50K Black 3.22
7 >50K Asian-Pac-Islander 3.20
8 >50K Amer-Indian-Eskimo 2.40
9 >50K Other 1.92
In [43]:
px.bar(df_race, x='class', y='Percent', color='race', title="class  w.r.t race",
                    barmode='group', text='Percent')

Observation

  • As the race white has the major count percentage so for the class white race has the majority count
In [44]:
dt_sex = dt['sex'].value_counts()
sex = pd.DataFrame(data=dt_sex.index, columns=["sex"])
sex['values'] = dt_sex.values
In [45]:
fig = px.pie(sex, values='values', names='sex', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()
In [46]:
histogram(dt,"sex","class",'class on sex','sex','Count')
In [47]:
dt_coun = dt['native-country'].value_counts()
country = pd.DataFrame(data=dt_coun.index, columns=["native-country"])
country['values'] = dt_coun.values
In [48]:
fig = px.pie(country, values='values', names='native-country', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()
In [49]:
histogram(dt,"native-country","class",'class on native-country','native-country','Count')
In [50]:
df_edu=dt.groupby('class')['native-country'].value_counts(normalize=True)
df_edu = df_edu.mul(100).rename('Percent').reset_index()
df_edu['Percent']=df_edu['Percent'].round(decimals=2)
df_edu.head(10)
Out[50]:
class native-country Percent
0 <=50K United-States 89.19
1 <=50K Mexico 2.43
2 <=50K ? 1.71
3 <=50K Philippines 0.57
4 <=50K Puerto-Rico 0.44
5 <=50K Germany 0.40
6 <=50K El-Salvador 0.39
7 <=50K Canada 0.32
8 <=50K Cuba 0.28
9 <=50K Dominican-Republic 0.26
In [51]:
px.bar(df_edu, x='class', y='Percent', color='native-country', title="class  w.r.t native-country",
                    barmode='group', text='Percent')

Feature-Selection

  • There are many attributs which are object type. By using get dummy,ordinal label encoding I will conver them to interger type
In [52]:
Sex=pd.get_dummies(dt.sex,drop_first=True,prefix='Sex')
In [53]:
dt=pd.concat([dt,Sex],axis=1)
In [54]:
dt.drop('sex',axis=1,inplace=True)
In [55]:
dt.head()
Out[55]:
age workclass fnlwgt education education-num marital-status occupation relationship race capitalgain capitalloss hoursperweek native-country class Sex_Male
0 2 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White 1 0 2 United-States <=50K 1
1 3 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White 0 0 0 United-States <=50K 1
2 2 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White 0 0 2 United-States <=50K 1
3 3 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black 0 0 2 United-States <=50K 1
4 1 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black 0 0 2 Cuba <=50K 0
In [56]:
dt.workclass.unique()
Out[56]:
array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', 'Unknown', 'Self-emp-inc', 'Without-pay',
       'Never-worked'], dtype=object)
In [57]:
ordinal_label=dt.groupby(['workclass'])['class'].count().sort_values().index
In [58]:
ordinal_label
Out[58]:
Index(['Never-worked', 'Without-pay', 'Federal-gov', 'Self-emp-inc',
       'State-gov', 'Unknown', 'Local-gov', 'Self-emp-not-inc', 'Private'],
      dtype='object', name='workclass')
In [59]:
list(enumerate(ordinal_label))
Out[59]:
[(0, 'Never-worked'),
 (1, 'Without-pay'),
 (2, 'Federal-gov'),
 (3, 'Self-emp-inc'),
 (4, 'State-gov'),
 (5, 'Unknown'),
 (6, 'Local-gov'),
 (7, 'Self-emp-not-inc'),
 (8, 'Private')]
In [60]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[60]:
{'Never-worked': 0,
 'Without-pay': 1,
 'Federal-gov': 2,
 'Self-emp-inc': 3,
 'State-gov': 4,
 'Unknown': 5,
 'Local-gov': 6,
 'Self-emp-not-inc': 7,
 'Private': 8}
In [61]:
dt['workclass_map']=dt['workclass'].map(ordinal_labels2)
In [62]:
dt.drop('workclass',axis=1,inplace=True)
In [63]:
dt.head()
Out[63]:
age fnlwgt education education-num marital-status occupation relationship race capitalgain capitalloss hoursperweek native-country class Sex_Male workclass_map
0 2 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White 1 0 2 United-States <=50K 1 4
1 3 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White 0 0 0 United-States <=50K 1 7
2 2 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White 0 0 2 United-States <=50K 1 8
3 3 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black 0 0 2 United-States <=50K 1 8
4 1 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black 0 0 2 Cuba <=50K 0 8
In [64]:
dt.education.unique()
Out[64]:
array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)
In [65]:
ordinal_label=dt.groupby(['education'])['class'].count().sort_values().index
ordinal_label
Out[65]:
Index(['Preschool', '1st-4th', '5th-6th', 'Doctorate', '12th', '9th',
       'Prof-school', '7th-8th', '10th', 'Assoc-acdm', '11th', 'Assoc-voc',
       'Masters', 'Bachelors', 'Some-college', 'HS-grad'],
      dtype='object', name='education')
In [66]:
list(enumerate(ordinal_label))
Out[66]:
[(0, 'Preschool'),
 (1, '1st-4th'),
 (2, '5th-6th'),
 (3, 'Doctorate'),
 (4, '12th'),
 (5, '9th'),
 (6, 'Prof-school'),
 (7, '7th-8th'),
 (8, '10th'),
 (9, 'Assoc-acdm'),
 (10, '11th'),
 (11, 'Assoc-voc'),
 (12, 'Masters'),
 (13, 'Bachelors'),
 (14, 'Some-college'),
 (15, 'HS-grad')]
In [67]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[67]:
{'Preschool': 0,
 '1st-4th': 1,
 '5th-6th': 2,
 'Doctorate': 3,
 '12th': 4,
 '9th': 5,
 'Prof-school': 6,
 '7th-8th': 7,
 '10th': 8,
 'Assoc-acdm': 9,
 '11th': 10,
 'Assoc-voc': 11,
 'Masters': 12,
 'Bachelors': 13,
 'Some-college': 14,
 'HS-grad': 15}
In [68]:
dt['education_map']=dt['education'].map(ordinal_labels2)
In [69]:
dt.drop('education',axis=1,inplace=True)
In [70]:
dt.head()
Out[70]:
age fnlwgt education-num marital-status occupation relationship race capitalgain capitalloss hoursperweek native-country class Sex_Male workclass_map education_map
0 2 77516 13 Never-married Adm-clerical Not-in-family White 1 0 2 United-States <=50K 1 4 13
1 3 83311 13 Married-civ-spouse Exec-managerial Husband White 0 0 0 United-States <=50K 1 7 13
2 2 215646 9 Divorced Handlers-cleaners Not-in-family White 0 0 2 United-States <=50K 1 8 15
3 3 234721 7 Married-civ-spouse Handlers-cleaners Husband Black 0 0 2 United-States <=50K 1 8 10
4 1 338409 13 Married-civ-spouse Prof-specialty Wife Black 0 0 2 Cuba <=50K 0 8 13
In [71]:
dt['marital-status'].unique()
Out[71]:
array(['Never-married', 'Married-civ-spouse', 'Divorced',
       'Married-spouse-absent', 'Separated', 'Married-AF-spouse',
       'Widowed'], dtype=object)
In [72]:
ordinal_label=dt.groupby(['marital-status'])['class'].count().sort_values().index
ordinal_label
Out[72]:
Index(['Married-AF-spouse', 'Married-spouse-absent', 'Widowed', 'Separated',
       'Divorced', 'Never-married', 'Married-civ-spouse'],
      dtype='object', name='marital-status')
In [73]:
list(enumerate(ordinal_label))
Out[73]:
[(0, 'Married-AF-spouse'),
 (1, 'Married-spouse-absent'),
 (2, 'Widowed'),
 (3, 'Separated'),
 (4, 'Divorced'),
 (5, 'Never-married'),
 (6, 'Married-civ-spouse')]
In [74]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[74]:
{'Married-AF-spouse': 0,
 'Married-spouse-absent': 1,
 'Widowed': 2,
 'Separated': 3,
 'Divorced': 4,
 'Never-married': 5,
 'Married-civ-spouse': 6}
In [75]:
dt['marital_map']=dt['marital-status'].map(ordinal_labels2)
In [76]:
dt.drop('marital-status',axis=1,inplace=True)
In [77]:
dt.head()
Out[77]:
age fnlwgt education-num occupation relationship race capitalgain capitalloss hoursperweek native-country class Sex_Male workclass_map education_map marital_map
0 2 77516 13 Adm-clerical Not-in-family White 1 0 2 United-States <=50K 1 4 13 5
1 3 83311 13 Exec-managerial Husband White 0 0 0 United-States <=50K 1 7 13 6
2 2 215646 9 Handlers-cleaners Not-in-family White 0 0 2 United-States <=50K 1 8 15 4
3 3 234721 7 Handlers-cleaners Husband Black 0 0 2 United-States <=50K 1 8 10 6
4 1 338409 13 Prof-specialty Wife Black 0 0 2 Cuba <=50K 0 8 13 6
In [78]:
dt['occupation']=dt['occupation'].replace({'?':'Unknown'})
In [79]:
dt['occupation'].unique()
Out[79]:
array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',
       'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',
       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',
       'Tech-support', 'Unknown', 'Protective-serv', 'Armed-Forces',
       'Priv-house-serv'], dtype=object)
In [80]:
ordinal_label=dt.groupby(['occupation'])['class'].count().sort_values().index
ordinal_label
Out[80]:
Index(['Armed-Forces', 'Priv-house-serv', 'Protective-serv', 'Tech-support',
       'Farming-fishing', 'Handlers-cleaners', 'Transport-moving', 'Unknown',
       'Machine-op-inspct', 'Other-service', 'Sales', 'Adm-clerical',
       'Exec-managerial', 'Craft-repair', 'Prof-specialty'],
      dtype='object', name='occupation')
In [81]:
list(enumerate(ordinal_label))
Out[81]:
[(0, 'Armed-Forces'),
 (1, 'Priv-house-serv'),
 (2, 'Protective-serv'),
 (3, 'Tech-support'),
 (4, 'Farming-fishing'),
 (5, 'Handlers-cleaners'),
 (6, 'Transport-moving'),
 (7, 'Unknown'),
 (8, 'Machine-op-inspct'),
 (9, 'Other-service'),
 (10, 'Sales'),
 (11, 'Adm-clerical'),
 (12, 'Exec-managerial'),
 (13, 'Craft-repair'),
 (14, 'Prof-specialty')]
In [82]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[82]:
{'Armed-Forces': 0,
 'Priv-house-serv': 1,
 'Protective-serv': 2,
 'Tech-support': 3,
 'Farming-fishing': 4,
 'Handlers-cleaners': 5,
 'Transport-moving': 6,
 'Unknown': 7,
 'Machine-op-inspct': 8,
 'Other-service': 9,
 'Sales': 10,
 'Adm-clerical': 11,
 'Exec-managerial': 12,
 'Craft-repair': 13,
 'Prof-specialty': 14}
In [83]:
dt['occupation_map']=dt['occupation'].map(ordinal_labels2)
In [84]:
dt.drop('occupation',axis=1,inplace=True)
In [85]:
dt.head()
Out[85]:
age fnlwgt education-num relationship race capitalgain capitalloss hoursperweek native-country class Sex_Male workclass_map education_map marital_map occupation_map
0 2 77516 13 Not-in-family White 1 0 2 United-States <=50K 1 4 13 5 11
1 3 83311 13 Husband White 0 0 0 United-States <=50K 1 7 13 6 12
2 2 215646 9 Not-in-family White 0 0 2 United-States <=50K 1 8 15 4 5
3 3 234721 7 Husband Black 0 0 2 United-States <=50K 1 8 10 6 5
4 1 338409 13 Wife Black 0 0 2 Cuba <=50K 0 8 13 6 14
In [86]:
dt['relationship'].unique()
Out[86]:
array(['Not-in-family', 'Husband', 'Wife', 'Own-child', 'Unmarried',
       'Other-relative'], dtype=object)
In [87]:
ordinal_label=dt.groupby(['relationship'])['class'].count().sort_values().index
ordinal_label
Out[87]:
Index(['Other-relative', 'Wife', 'Unmarried', 'Own-child', 'Not-in-family',
       'Husband'],
      dtype='object', name='relationship')
In [88]:
list(enumerate(ordinal_label))
Out[88]:
[(0, 'Other-relative'),
 (1, 'Wife'),
 (2, 'Unmarried'),
 (3, 'Own-child'),
 (4, 'Not-in-family'),
 (5, 'Husband')]
In [89]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[89]:
{'Other-relative': 0,
 'Wife': 1,
 'Unmarried': 2,
 'Own-child': 3,
 'Not-in-family': 4,
 'Husband': 5}
In [90]:
dt['relationship_map']=dt['relationship'].map(ordinal_labels2)
In [91]:
dt.drop('relationship',axis=1,inplace=True)
In [92]:
dt.head()
Out[92]:
age fnlwgt education-num race capitalgain capitalloss hoursperweek native-country class Sex_Male workclass_map education_map marital_map occupation_map relationship_map
0 2 77516 13 White 1 0 2 United-States <=50K 1 4 13 5 11 4
1 3 83311 13 White 0 0 0 United-States <=50K 1 7 13 6 12 5
2 2 215646 9 White 0 0 2 United-States <=50K 1 8 15 4 5 4
3 3 234721 7 Black 0 0 2 United-States <=50K 1 8 10 6 5 5
4 1 338409 13 Black 0 0 2 Cuba <=50K 0 8 13 6 14 1
In [93]:
dt.race.unique()
Out[93]:
array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
       'Other'], dtype=object)
In [94]:
ordinal_label=dt.groupby(['race'])['class'].count().sort_values().index
ordinal_label
Out[94]:
Index(['Other', 'Amer-Indian-Eskimo', 'Asian-Pac-Islander', 'Black', 'White'], dtype='object', name='race')
In [95]:
list(enumerate(ordinal_label))
Out[95]:
[(0, 'Other'),
 (1, 'Amer-Indian-Eskimo'),
 (2, 'Asian-Pac-Islander'),
 (3, 'Black'),
 (4, 'White')]
In [96]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[96]:
{'Other': 0,
 'Amer-Indian-Eskimo': 1,
 'Asian-Pac-Islander': 2,
 'Black': 3,
 'White': 4}
In [97]:
dt['race_map']=dt['race'].map(ordinal_labels2)
In [98]:
dt.drop('race',axis=1,inplace=True)
In [99]:
dt.head()
Out[99]:
age fnlwgt education-num capitalgain capitalloss hoursperweek native-country class Sex_Male workclass_map education_map marital_map occupation_map relationship_map race_map
0 2 77516 13 1 0 2 United-States <=50K 1 4 13 5 11 4 4
1 3 83311 13 0 0 0 United-States <=50K 1 7 13 6 12 5 4
2 2 215646 9 0 0 2 United-States <=50K 1 8 15 4 5 4 4
3 3 234721 7 0 0 2 United-States <=50K 1 8 10 6 5 5 3
4 1 338409 13 0 0 2 Cuba <=50K 0 8 13 6 14 1 3
In [100]:
dt['native-country'].unique()
Out[100]:
array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)
In [101]:
ordinal_label=dt.groupby(['native-country'])['class'].count().sort_values().index
ordinal_label
Out[101]:
Index(['Holand-Netherlands', 'Hungary', 'Honduras', 'Scotland', 'Laos',
       'Outlying-US(Guam-USVI-etc)', 'Yugoslavia', 'Trinadad&Tobago',
       'Cambodia', 'Thailand', 'Hong', 'Ireland', 'France', 'Ecuador', 'Peru',
       'Greece', 'Nicaragua', 'Iran', 'Taiwan', 'Portugal', 'Haiti',
       'Columbia', 'Vietnam', 'Poland', 'Guatemala', 'Japan',
       'Dominican-Republic', 'Italy', 'Jamaica', 'South', 'China', 'England',
       'Cuba', 'India', 'El-Salvador', 'Canada', 'Puerto-Rico', 'Germany',
       'Philippines', '?', 'Mexico', 'United-States'],
      dtype='object', name='native-country')
In [102]:
list(enumerate(ordinal_label))
Out[102]:
[(0, 'Holand-Netherlands'),
 (1, 'Hungary'),
 (2, 'Honduras'),
 (3, 'Scotland'),
 (4, 'Laos'),
 (5, 'Outlying-US(Guam-USVI-etc)'),
 (6, 'Yugoslavia'),
 (7, 'Trinadad&Tobago'),
 (8, 'Cambodia'),
 (9, 'Thailand'),
 (10, 'Hong'),
 (11, 'Ireland'),
 (12, 'France'),
 (13, 'Ecuador'),
 (14, 'Peru'),
 (15, 'Greece'),
 (16, 'Nicaragua'),
 (17, 'Iran'),
 (18, 'Taiwan'),
 (19, 'Portugal'),
 (20, 'Haiti'),
 (21, 'Columbia'),
 (22, 'Vietnam'),
 (23, 'Poland'),
 (24, 'Guatemala'),
 (25, 'Japan'),
 (26, 'Dominican-Republic'),
 (27, 'Italy'),
 (28, 'Jamaica'),
 (29, 'South'),
 (30, 'China'),
 (31, 'England'),
 (32, 'Cuba'),
 (33, 'India'),
 (34, 'El-Salvador'),
 (35, 'Canada'),
 (36, 'Puerto-Rico'),
 (37, 'Germany'),
 (38, 'Philippines'),
 (39, '?'),
 (40, 'Mexico'),
 (41, 'United-States')]
In [103]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_label,0)}
ordinal_labels2
Out[103]:
{'Holand-Netherlands': 0,
 'Hungary': 1,
 'Honduras': 2,
 'Scotland': 3,
 'Laos': 4,
 'Outlying-US(Guam-USVI-etc)': 5,
 'Yugoslavia': 6,
 'Trinadad&Tobago': 7,
 'Cambodia': 8,
 'Thailand': 9,
 'Hong': 10,
 'Ireland': 11,
 'France': 12,
 'Ecuador': 13,
 'Peru': 14,
 'Greece': 15,
 'Nicaragua': 16,
 'Iran': 17,
 'Taiwan': 18,
 'Portugal': 19,
 'Haiti': 20,
 'Columbia': 21,
 'Vietnam': 22,
 'Poland': 23,
 'Guatemala': 24,
 'Japan': 25,
 'Dominican-Republic': 26,
 'Italy': 27,
 'Jamaica': 28,
 'South': 29,
 'China': 30,
 'England': 31,
 'Cuba': 32,
 'India': 33,
 'El-Salvador': 34,
 'Canada': 35,
 'Puerto-Rico': 36,
 'Germany': 37,
 'Philippines': 38,
 '?': 39,
 'Mexico': 40,
 'United-States': 41}
In [104]:
dt['native-country_map']=dt['native-country'].map(ordinal_labels2)
In [105]:
dt.drop('native-country',axis=1,inplace=True)
In [106]:
dt.head()
Out[106]:
age fnlwgt education-num capitalgain capitalloss hoursperweek class Sex_Male workclass_map education_map marital_map occupation_map relationship_map race_map native-country_map
0 2 77516 13 1 0 2 <=50K 1 4 13 5 11 4 4 41
1 3 83311 13 0 0 0 <=50K 1 7 13 6 12 5 4 41
2 2 215646 9 0 0 2 <=50K 1 8 15 4 5 4 4 41
3 3 234721 7 0 0 2 <=50K 1 8 10 6 5 5 3 41
4 1 338409 13 0 0 2 <=50K 0 8 13 6 14 1 3 32
In [107]:
dt.corr()
Out[107]:
age fnlwgt education-num capitalgain capitalloss hoursperweek Sex_Male workclass_map education_map marital_map occupation_map relationship_map race_map native-country_map
age 1.000000 -0.076674 0.034859 0.124929 0.060768 0.115442 0.090898 -0.140747 -0.079607 -0.019098 0.068226 0.233844 0.038893 -0.006591
fnlwgt -0.076674 1.000000 -0.038761 -0.004681 -0.004643 -0.008893 0.027739 0.032907 -0.031603 -0.004719 -0.011290 -0.016904 -0.003802 -0.020118
education-num 0.034859 -0.038761 1.000000 0.160389 0.084891 0.146786 0.009328 -0.127880 0.176743 0.100430 0.316785 0.105458 0.037397 0.030299
capitalgain 0.124929 -0.004681 0.160389 1.000000 -0.055408 0.099180 0.070443 -0.051983 -0.050346 0.083750 0.091530 0.085162 0.021527 0.011356
capitalloss 0.060768 -0.004643 0.084891 -0.055408 1.000000 0.056712 0.046633 -0.027307 -0.021970 0.050783 0.045576 0.055296 0.017419 0.002629
hoursperweek 0.115442 -0.008893 0.146786 0.099180 0.056712 1.000000 0.238820 -0.010195 0.018762 0.148672 0.085183 0.237633 0.033770 0.002810
Sex_Male 0.090898 0.027739 0.009328 0.070443 0.046633 0.238820 1.000000 -0.017523 -0.042150 0.404052 -0.030488 0.552630 0.066100 0.004589
workclass_map -0.140747 0.032907 -0.127880 -0.051983 -0.027307 -0.010195 -0.017523 1.000000 0.022475 -0.025797 0.001155 -0.045917 0.022345 -0.020590
education_map -0.079607 -0.031603 0.176743 -0.050346 -0.021970 0.018762 -0.042150 0.022475 1.000000 -0.004997 -0.014477 -0.016023 0.031858 0.081747
marital_map -0.019098 -0.004719 0.100430 0.083750 0.050783 0.148672 0.404052 -0.025797 -0.004997 1.000000 0.047331 0.407818 0.067408 0.017501
occupation_map 0.068226 -0.011290 0.316785 0.091530 0.045576 0.085183 -0.030488 0.001155 -0.014477 0.047331 1.000000 0.054248 0.032234 -0.002051
relationship_map 0.233844 -0.016904 0.105458 0.085162 0.055296 0.237633 0.552630 -0.045917 -0.016023 0.407818 0.054248 1.000000 0.114210 0.041696
race_map 0.038893 -0.003802 0.037397 0.021527 0.017419 0.033770 0.066100 0.022345 0.031858 0.067408 0.032234 0.114210 1.000000 0.214065
native-country_map -0.006591 -0.020118 0.030299 0.011356 0.002629 0.002810 0.004589 -0.020590 0.081747 0.017501 -0.002051 0.041696 0.214065 1.000000
In [108]:
plt.figure(figsize=(20,20))
corr=dt.corr()
sns.heatmap(corr,annot=True,cmap=plt.cm.CMRmap_r)
plt.show()
In [109]:
def correlation_feature(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr
In [110]:
corr_features = correlation_feature(dt, 0.5)
len(set(corr_features))
Out[110]:
1
In [81]:
corr_features
Out[81]:
{'relationship_map'}
In [82]:
zerovar=dt.var()[dt.var()==0].index.values
In [83]:
zerovar
Out[83]:
array([], dtype=object)
In [84]:
plt.figure(figsize=(8,8))
sns.countplot('class',data=dt)
plt.show()

Observation

  • From this plot we can clearly see that the Class column which is our target column is imbalanced
In [85]:
x=dt.drop('class',axis=1)
y=dt['class']
In [89]:
#from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
smk = SMOTETomek()
X_res,y_res=smk.fit_resample(x,y)
In [90]:
plt.figure(figsize=(8,8))
sns.countplot(y_res,data=dt)
plt.show()

Observation

  • From this plot we can clearly see that the Class column which is our target column is balanced now
  • We balanced this by using Smote over sampling technique

Model Development

  • In this section I will use differnt machine learning algorithm to create a model which will predict wheater a customer will earn <=50k or >50k

Performance Metric

  • I am going to use Confusion matrix and accuracy score to check how accurately my model is predicting when a new data is fed to the model
In [91]:
x_train,x_test,y_train,y_test=train_test_split(X_res,y_res,test_size=0.3,random_state=42)
In [92]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape
Out[92]:
((46758, 14), (46758,), (20040, 14), (20040,))

Logistic Regression

In [93]:
log_reg=LogisticRegression()
log_reg.fit(x_train,y_train)
log_pred=log_reg.predict(x_test)
In [94]:
cm1=confusion_matrix(y_test,log_pred)
sns.heatmap(cm1,annot=True,fmt='d')
Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x2e35c0739e8>
In [95]:
print(accuracy_score(y_test,log_pred))
print(classification_report(y_test,log_pred))
0.49945109780439123
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
              precision    recall  f1-score   support

       <=50K       0.50      1.00      0.67     10009
        >50K       0.00      0.00      0.00     10031

    accuracy                           0.50     20040
   macro avg       0.25      0.50      0.33     20040
weighted avg       0.25      0.50      0.33     20040

C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Observation

  • As We know when the accuracy score is high it shows that our model is working fine
  • Here the accuracy score is only 49% hence we will try different algorithm to train our model

RandomForestClassifier

In [98]:
model_rand=RandomForestClassifier()
model_rand.fit(x_train,y_train)
model_rand_test=model_rand.predict(x_test)
In [99]:
cm1=confusion_matrix(y_test,model_rand_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[99]:
<matplotlib.axes._subplots.AxesSubplot at 0x2e35c0ba908>
In [100]:
print('Accuracy Score:',accuracy_score(y_test,model_rand_test))
print(classification_report(y_test,model_rand_test))
Accuracy Score: 0.8713073852295409
              precision    recall  f1-score   support

       <=50K       0.87      0.87      0.87     10009
        >50K       0.87      0.88      0.87     10031

    accuracy                           0.87     20040
   macro avg       0.87      0.87      0.87     20040
weighted avg       0.87      0.87      0.87     20040

Observation

  • After applying random forest we can see the accuracy score has improved but we will try to improve more
In [105]:
cl=pd.get_dummies(dt['class'],drop_first=True,prefix='Class')
In [106]:
dt=pd.concat([dt,cl],axis=1)
dt.drop('class',axis=1,inplace=True)
In [107]:
dt.head()
Out[107]:
age fnlwgt education-num capitalgain capitalloss hoursperweek Sex_Male workclass_map education_map marital_map occupation_map relationship_map race_map native-country_map Class_>50K
0 2 77516 13 1 0 2 1 4 13 5 11 4 4 41 0
1 3 83311 13 0 0 0 1 7 13 6 12 5 4 41 0
2 2 215646 9 0 0 2 1 8 15 4 5 4 4 41 0
3 3 234721 7 0 0 2 1 8 10 6 5 5 3 41 0
4 1 338409 13 0 0 2 0 8 13 6 14 1 3 32 0
  • I will apply scalling and the re run the whole process to check wheather we got better accuracy or not
In [108]:
min_dt=dt.min()
range_dt=(dt-min_dt).max()
dt_scaled = (dt-min_dt)/range_dt
In [109]:
dt_scaled.head()
Out[109]:
age fnlwgt education-num capitalgain capitalloss hoursperweek Sex_Male workclass_map education_map marital_map occupation_map relationship_map race_map native-country_map Class_>50K
0 0.50 0.044131 0.800000 0.25 0.0 0.5 1.0 0.500 0.866667 0.833333 0.785714 0.8 1.00 1.000000 0.0
1 0.75 0.048052 0.800000 0.00 0.0 0.0 1.0 0.875 0.866667 1.000000 0.857143 1.0 1.00 1.000000 0.0
2 0.50 0.137581 0.533333 0.00 0.0 0.5 1.0 1.000 1.000000 0.666667 0.357143 0.8 1.00 1.000000 0.0
3 0.75 0.150486 0.400000 0.00 0.0 0.5 1.0 1.000 0.666667 1.000000 0.357143 1.0 0.75 1.000000 0.0
4 0.25 0.220635 0.800000 0.00 0.0 0.5 0.0 1.000 0.866667 1.000000 1.000000 0.2 0.75 0.780488 0.0
In [110]:
plt.figure(figsize=(8,8))
sns.countplot('Class_>50K',data=dt_scaled)
plt.show()
In [111]:
x=dt_scaled.drop('Class_>50K',axis=1)
y=dt_scaled['Class_>50K']
In [112]:
smk = SMOTETomek()
X_res,y_res=smk.fit_resample(x,y)
In [113]:
plt.figure(figsize=(8,8))
sns.countplot(y_res,data=dt_scaled)
plt.show()
In [114]:
x_train,x_test,y_train,y_test=train_test_split(X_res,y_res,test_size=0.3,random_state=42)
In [115]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape
Out[115]:
((49219, 14), (49219,), (21095, 14), (21095,))
In [116]:
log_reg=LogisticRegression()
log_reg.fit(x_train,y_train)
log_pred=log_reg.predict(x_test)
In [117]:
cm1=confusion_matrix(y_test,log_pred)
sns.heatmap(cm1,annot=True,fmt='d')
Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x2e35ab772e8>
In [118]:
print(accuracy_score(y_test,log_pred))
print(classification_report(y_test,log_pred))
0.8255984830528561
              precision    recall  f1-score   support

         0.0       0.83      0.81      0.82     10493
         1.0       0.82      0.84      0.83     10602

    accuracy                           0.83     21095
   macro avg       0.83      0.83      0.83     21095
weighted avg       0.83      0.83      0.83     21095

Observation

  • After applying scalling to our dataset we can see the accuracy has improved for logistic regression model
In [119]:
model_rand=RandomForestClassifier()
model_rand.fit(x_train,y_train)
model_rand_test=model_rand.predict(x_test)
In [120]:
cm1=confusion_matrix(y_test,model_rand_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x2e367c7ce48>
In [121]:
print('Accuracy Score:',accuracy_score(y_test,model_rand_test))
print(classification_report(y_test,model_rand_test))
Accuracy Score: 0.9176108082484001
              precision    recall  f1-score   support

         0.0       0.92      0.91      0.92     10493
         1.0       0.91      0.92      0.92     10602

    accuracy                           0.92     21095
   macro avg       0.92      0.92      0.92     21095
weighted avg       0.92      0.92      0.92     21095

Observation

  • So after scalling the dataset we get to see a improvement in accuracy score for our Random forest model
  • So we can conclude that the model we created using Random forest algorith with the scalled data is the best model for our data set
In [ ]: